A Comparison of String Distance Metrics for Name-Matching Tasks
نویسندگان
چکیده
Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators , token-based distance metrics, and hybrid methods. Overall, the best-performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme, which was developed in the probabilistic record linkage community.
منابع مشابه
Usability of String Distance Metrics for Name Matching Tasks in Polish
This paper presents results of the numerous experiments on usability of well-established string distance metrics and some new variants thereof for various name matching tasks in Polish.
متن کاملA Comparison of String Metrics for Matching Names and Records
We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance metrics on the task of matching entity names. These metrics include distance functions proposed by several different communities, such as edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. We ...
متن کاملChronic Treatment by L-NAME differently Affects Morris Water Maze Tasks in Ovariectomized and Naïve Female Rats
Introduction: The role of ovarian hormones and nitric oxide (NO) in learning and memory and their interaction has been widely investigated. The present study carried out to evaluate different effect of L-NAME on spatial learning and memory of ovariectomized (OVX) and sham operated rats. Methods: 32 rats were divided into 4 groups: 1) Sham 2) OVX 3) Sham-LN and 4) OVX-LN. The animals of groups 3...
متن کاملReal World Performance of Approximate String Comparators for use in Patient Matching
Medical record linkage is becoming increasingly important as clinical data is distributed across independent sources. To improve linkage accuracy we studied different name comparison methods that establish agreement or disagreement between corresponding names. In addition to exact raw name matching and exact phonetic name matching, we tested three approximate string comparators. The approximate...
متن کاملA Comparison of String Distance Metrics on Usernames for Cross-Platform Identification
People often use similar usernames across different social media sites. This fact can be used to correlate accounts between different platforms. Since the first mention of this fact in 2009 no research has been done on how to exploit it most efficiently. We showed that ignoring the casing will most definitely improve the matching and we found that Smith-Waterman provides the best metric to matc...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003